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Abstract 

In the framework of on-line learning, a learning machine might move around a teacher due to the 
differences in structures or output functions between the teacher and the learning machine or due to 
noises. The generalization performance of a new student supervised by a moving machine has been 
analyzed. A model composed of a true teacher, a moving teacher and a student that are all linear 
perceptrons with noises has been treated analytically using statistical mechanics. It has been proven 
that the generalization errors of a student can be smaller than that of a moving teacher, even if the 
student only uses examples from the moving teacher. 
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1 Introduction 

Learning is to infer the underlying rules that dominate data generation using observed data. The observed 
data are input-output pairs from a teacher. They are called examples. Learning can be roughly classified 
into batch learning and on-line learning P^. In batch learning, some given examples are used repeatedly. 
In this paradigm, a student becomes to give correct answers after training if that student has an adequate 
degree of freedom. However, it is necessary to have a long amount of time and a large memory in which 
many examples may be stored. On the contrary, examples used once are discarded in on-line learning. 
In this case, a student cannot give correct answers for all examples used in training. However, there are 
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some merits, for example, a large memory for storing many examples isn't necessary and it is possible to 
follow a time variant teacher. 

Recently, we [HJ |7] have analyzed the generalization performance of ensemble learning [31 0J [5] m a 
framework of on-line learning using a statistical mechanical method In that process, the following 

points are proven subsidiarily. The generalization error doesn't approach zero when the student is a simple 
perceptron and the teacher is a committee machine or a non-monotonic perceptron |12| . Therefore, 
models like these can be called unlearnable cases jHlEI]- The behavior of a student in an unlearnable case 
depends on the learning rule. That is, the student vector asymptotically converges in one direction using 
Hebbian learning. On the contrary, the student vector doesn't converge in one direction but continues 
moving using perceptron learning or AdaTron learning. In the case of a non-monotonic teacher, the 
student's behavior can be expressed by continuing to go around the teacher, keeping a constant direction 
cosine with the teacher. 

Considering the applications of statistical learning theories, investigating the system behaviors of 
unlearnable cases is very significant since real world problems seem to include many unlearnable cases. 
In addition, a learning machine may continue going around a teacher in the unlearnable cases as mentioned 
above. Here, let us consider a new student that is supervised by a moving learning machine. That is, 
we consider a student that uses the input-output pairs of a moving teacher as training examples and we 
investigate the generalization performance of a student with a true teacher. Note that the examples used 
by the student are only from the moving teacher and the student can't directly observe the outputs of 
the true teacher. In a real human society, a teacher that can be observed by a student doesn't always 
present the correct answer. In many cases, the teacher is learning and continues to vary. Therefore, the 
analysis of such a model is interesting for considering the analogies between statistical learning theories 
and a real society. 

In this paper, we treat a model in which a true teacher, a moving teacher and a student are all linear 
perceptrons [H] with noises, as the simplest model in which a moving teacher continues going around 
a true teacher. We calculate the order parameters and the generalization errors analytically using a 
statistical mechanical method in the framework of on-line learning. As a result, it is proven that a 
student's generalization errors can be smaller than that of the moving teacher. That means the student 
can be cleverer than the moving teacher even though the student uses only the examples of the moving 
teacher. 

2 Model 

Three linear perceptrons are treated in this paper: a true teacher, a moving teacher and a student. Their 
connection weights are A, B and J , respectively. For simplicity, the connection weight of the true teacher, 
that of the moving teacher and that of the student are simply called the true teacher, the moving teacher, 
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and the student, respectively. The true teacher A = (A\, . . . , Ajv), the moving teacher B = {B\, . . . , -Bat), 
the student J = (Ji, . . . , <//v), and input a; = (x-y, . . . ,xj\t) are N dimensional vectors. Each component 
Aj of A is drawn from Af(0, 1) independently and fixed, where Af(0, 1) denotes the Gaussian distribution 
with a mean of zero and a variance unity. Each of the components B® , J 4 ° of the initial values of B, J 
are drawn from Af(0, 1) independently. Each component Xi of x is drawn from Af(0, 1/iV) independently. 
Thus, 

(A>) = 0, ((A) 2 ) = 1, (1) 
(B?> = 0, ((B t ?) 2 ) = l, (2) 
<J°> = 0, <(J° ) 2 > = 1, (3) 
(or,) = 0, (Or,) 2 ) = 1 (4) 

where (•) denotes a mean. 

In this paper, the thermodynamic limit AT — * oo is also treated. Therefore, 

\\A\\=VN, \\B°\\=VN, \\J°\\ = Vn, 11*11 = 1, (5) 

where |j • || denotes a vector norm. Generally, norms ||B|| and ||J|| of the moving teacher and the student 
change as the time step proceeds. Therefore, the ratios Ib and lj of the norms to y/N are introduced 
and arc called the length of the moving teacher and the length of the student. That is, ||JB|| = IbVNC 
\\J\\ =IjVN. 

The outputs of the true teacher, the moving teacher, and the student are y m + n™, v m l B l + rig, and 
M m /™ + n r j, respectively. Here, 

v m r B l = B m ■ x m , (7) 
u m VJ l = J m -x m , (8) 

and 

< ~ Af(0,a 2 A ), (9) 
~ M{0,a%), (10) 
nj ~ AA(0, CT 2 ). (11) 

where m denotes the time step. That is, the outputs of the true teacher, the moving teacher and the 
student include independent Gaussian noises with variances of a\,a B , and <jj, respectively. Then, the 
y m , v m , and u m of Eqs.@-|jHJ) obey the Gaussian distributions with a mean of zero and a variance unity. 

In the model treated in this paper, the moving teacher B is updated using an input x and an output 
of the true teacher A for the input x. The student J is updated by using an input x and an output of 
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the moving teacher B for the input x. Let us define an error between the true teacher and the moving 
teacher by the squared error of their outputs. That is, 

e<S = l(y m +n%-v m l%-n>£) 2 . (12) 

The moving teacher is considered to use the gradient method for learning. That is, 

B m+1 = B m - VB ^ (13) 
+ VB[y +n A -v l B -n B )x , (14) 

where, t\b denotes the learning rate of the moving teacher and is a constant number. 

In the same manner, let us define an error between the moving teacher and the student by the squared 
error of their outputs. That is, 

^Bj = \(v m lB + ™b ~u m l7 -njf. (15) 
The student is considered to use the gradient method for learning. That is, 

am 

jm+l _ jrm W ^BJ 

J -J -VJ-QjWt \ lb ) 

= J m + r)j {v m l™ + n™ - u m iy ~ny)x m , (17) 

where, rjj denotes a learning rate of the student and is a constant number. 
Generalizing the learning rules, Eas. (|14|l and l|17|l can be expressed as 

B m+l = B m + g{y m + n%,v m l B l + n B l )x m , (18) 
jm+l = J^ + f(yrn l m + n m^ umi m +n ms )xm ^ (lg) 

respectively. 

Let us define an error between the true teacher and the student by the squared error of their outputs. 
That is, 
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:^-(y m + n^~u m lJ-njf. (20) 



3 Theory 

3.1 Generalization Error 

One purpose of a statistical learning theory is to theoretically obtain generalization errors. Since a 
generalization error is the mean of errors for the true teacher over the distribution of the new input and 
noises, the generalization error esg of the moving teacher and ej g of the student are calculated as follows. 
The superscripts to, which represent the time steps, are omitted for simplicity. 

e Bg = J dxdnAdn B P{x,n Al n B )(-B (21) 



dydvdnAdnBP(y: v, n A , ns) 

1 2 

^^(y + n A -vl B ~ n B ) (22) 

^{-2R b I b + {Ib) 2 + l + 4 + (23) 
/ dxdnAdTijP(x,nA 1 nj)ej (24) 

dydudriAdnjP(y, u 1 ua, nj) 

1 2 
^-^{.y + nA-ulj-nj) (25) 

= 1 (-2J2jJj + (/, 7 ) 2 + 1 + o\ + a]) . (26) 

In addition, let us calculate the mean f-Bjg of the error between the student and the moving teacher 
as follows: 

e B j g = J dxdn B dnjP{x 1 nB 1 nj)e B .j (27) 

= J dvdudn,BdnjP(v,u,nB,nj) 

1 2 
x 2 + nB ~ u ^ - nj ) ( 28 ) 

= i(-2i? BJ / s /, 7 + a 7 ) 2 + ai3) 2 +4+^j)- (29) 

Here, the integration has been executed using the following: y, v and u obeys A/"(0, 1). The covariance 
between y and v is Rb, between y and u is and between v and u is Rbj, where 

A B A J J5 J 

^ = pjipjp Rj = um\' Rbj = WM' m 

Ea. (|3(J[l means that i? 7 , and are direction cosines, ha, and nj are all independent with other 
probabilistic variables. The true teacher A, the moving teacher B, the student J, and the relationship 
among Rb, Rj, and Rbj are shown in Fig^ 

3.2 Differential equations of order parameters and their analytical solutions 

To make analysis easy, the following auxiliary order parameters are introduced: 

r B = RbIb, (31) 
rj ee Rjlj, (32) 
r B j = RbjIbIj- (33) 

Simultaneous differential equations in deterministic forms jS] have been obtained that describe the 
dynamical behaviors of order parameters based on self- averaging in the thermodynamic limits as follows: 
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Figure 1: True teacher A, moving teacher B and student J. Rb,Rj, and Rbj are direction cosines. 



d ir = </■*■ < 35 » 

= (j{9») +<£>(/») + («/), (36) 

§ - ■ (38 » 

Since linear perceptrons are treated in this paper, the sample averages that appeared in the above 
equations can be calculated easily as follows: 

(39) 
(40) 
(41) 
(42) 
(43) 
(44) 
(45) 
(46) 
(47) 

Since each components of the true teacher A, the initial value of the moving teacher B, and the 
initial value of the student J are drawn from 1) independently and because the thermodynamic 
limit N — > co is also treated, they are all orthogonal to each other in the initial state. That is, 

R° B = R Q j = R BJ = 0. (48) 



(a u ) 


= VB(rj - r B j)/lj, 




(fv) 


= Vj( 1 b - Tbj/Ib), 




<<?/> 


= VBVj( r B ~ r,j - 1% + 


vbj -0%), 


(fv) 


= Vj( r B-rj), 




(gy) 


= r? s (l-r B ), 




(gv) 


= r}B{rB/lB — Ib), 




(g 2 ) 


= V%^ + o 2 A +o%+l% 


-2r B ), 


(/«) 


= V.j( r Bj/lj ~ 




</ 2 > 


2 ( il i f% i 2 i 2 


- 2r BJ ). 



(» 



In addition, 



l°B = I J = I- (49) 



By using Eas. i|39|) - (|49[) . the simultaneous differential equations Eas. (|34[) - I|38|) can be solved analyti- 
cally as follows: 



where 



H 



r B = l-e^ B \ (50) 
rj = 1 + VB e~^ jt T Ll—e- r > Bt , (51) 

vj -Vb vj - vb 

D 



vbvj - vb - vj 



2 V j - VB c _„ Bt VB c _„, rf 



?7b - ??j <7j - Vb 

_l_ (y c vB(nB-2)t _|_ £ e (vBT)j-VB-n.i)t ^ ^52) 

f| = 3-C-2e~" Bt + Ce" B ( I ' B - 2 ) t , (53) 
G 



_)_ e VB(vB-2)t 



Vj(vj - 2) 
r? s (?7 B - 2) - t)j{j)j - 2)' 

2 ^ C -Vjt „ 2? ?J c -t,Bt 

<7J -VB VJ -VB 

2yjE 

Vb - VJ 



C = 2--^-{a\ + al), (55) 
2 - vb 

D = V b(1-Vj^b)+Vj(^-Vb)(3-C), (56) 
2 

771 ~VbVJ (J2 , J\ 

£ = 7 77 r^A + Os) 

(Vj ~ Vb)(vbVj -Vb - vj) 

%Vb j] B {1-Vj^b) 



(57) 

Vj - vb vbvj -vb-vj 

F = V ^ B + T1J - 2 C, (58) 
Vb - Vj 

G = Vj (3 + a B + a j - C ) , (59) 

VBVJ - VB-VJ 



F 



Vb(vb - 2) - r]j(r]j - 2) 
G 2^ 



— J5. (60) 



?7j(»?7 - 2) rjB - VJ 

4 Results and discussion 

The dynamical behaviors of the generalization errors e_Bg,e./g and £B.jg have been obtained analytically 
by solving Eas.pty. 129, , and l|5U H 5U |) . Figures H and show the analytical results 

and the corresponding simulation results, where N = 10 3 . In the computer simulations, eB g ,£j g , and 
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e B.jg have been obtained by averaging the squared errors for 10 4 random inputs at each time step. The 
dynamical behaviors of R and I are shown in Figs0| and |SJ In these figures, the curves represent the 
theoretical results. The dots represent the simulation results. Conditions other than r\j are common: 
rjB = 1.0, <j\ = 0.2, ct^ = 0.3, and (Pj = 0.4. Figures [3 and 01 show the results in the case of rjj = 1.2. 
Figures and [S] show the results in the case of rjj = 0.3. 
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Figure 2: Generalization errors ej g , es 9 , and tBjg in the case of 77,7 = 1.2. Theory and computer 
simulation. Conditions other than rjj are tjb = 1-0, er^ = 0.2, cr^ = 0.3, and <7j = 0.4. 
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Figure 3: Generalization errors ej g , CBg, and tBjg in the case of rjj = 0.3. Theory and computer 
simulation. Conditions other than rjj are t/b = 1.0, tr^ = 0.2, cr^ = 0.3, and ct} = 0.4. 



Figure El shows that the generalization error ej g of the student is always larger than the generalization 
error esg of the moving teacher when the learning rate of student is relatively large, such as ijj = 1.2. In 



addition, the mean £_b,/ 9 of the error between the moving teacher and the student is still larger than ej g . 
Figure 0] shows that the direction cosine Rj between the true teacher and the student is always smaller 
than the direction cosine Rb between the true teacher and the moving teacher. 
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Figure 4: R and Z in the case of t]j = 1.2. Theory and computer simulation. Conditions other than rjj 
are rj B = 1.0, cr^ = 0.2, cr| = 0.3, and cr 2 7 = 0.4. 
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Figure 5: R and I in the case of r\j — 0.3. Theory and computer simulation. Conditions other than rjj 
are r] B = 1.0, a\ = 0.2, cr| = 0.3, and aj = 0.4. 

On the contrary, Fig[3] shows that when the learning rate of the student is relatively small, that is 
rjj = 0.3. Although the generalization error ej g of the student is larger than the generalization error esg 
of the moving teacher in the initial stage of learning, as in the case of i]j — 1.2, the size relationship is 
reversed at t = 4.4, and after that ej g is smaller than es g . This means the performance of the student 
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becomes higher than that of the moving teacher. In regard to the direction cosine, FigO shows that 
though the direction cosine R,j between the true teacher and the student is smaller than the direction 
cosine Rb between the true teacher and the moving teacher in the initial stage of learning, the size 
relationship is reversed at t — 5.2, and after that, Rj grows larger than Rg. This means that the student 
gets closer to the true teacher than the moving teacher in spite of the student only observing the moving 
teacher. The reason why the size relationship reverses at different times in FigO and Fig[S] is that the 
generalization error depends on not only the direction cosines Rb, Rj, and Rbj but also the lengths Is 
and I j as shown in Figs.(J2HJ), H2fi[L and Ij29(l since linear perceptrons are treated and the squared error is 
adopted as an error in this paper. In any case, these results show that the student can have higher level 
of performance than the moving teacher. It depends on the learning rate rjj of the student. This is a 
very interesting fact. 

In addition, both Figs. 0] and [S] show that the direction cosine Rbj between the moving teacher and 
the student takes a negative value in the initial stage of learning. That is, the angle between the moving 
teacher and the student once becomes larger than in the initial condition. This means that the student 
is once delayed. This is also an interesting phenomenon. 

Figures |21 - IS1 show that ts s , £j g , tBjg, R, and I almost seem to reach a steady state by t = 20. The 
macroscopic behaviors of t — > oo can be understood theoretically since the order parameters have been 
obtained analytically. Focusing on the signs of the powers of the exponential functions in Eas. l|50Jl - l|54|l . 
we can see that tBg and CBjg diverge if > t\b or t\b > 2, and csjg and ej g diverge if > rjj or rjj > 2. 
The steady state values of e_B 9 ,ej ff , £Bjg, R, and I in the case of < r\B,f]j < 2 can be easily obtained 
by substituting t — ► oo in Eas. (|50(l - (|54|l . The relationships that are obtained by this operation, between 
the learning rate rjj of the student and es ff ,ej g , esj g , R, and /, are shown in Figs. El El and|HJ The 
conditions other than rj j are t\b — 1.0, a\ = 0.2, 0% = 0.3, and a] = 0.4 that are the same as Figs. I2T-IB1 
The values on t = 50 are plotted for the simulations. The values are considered to have already reached 
a steady state. 

These figures show the following: though the steady generalization error of the student is larger than 
that of the moving teacher if rjj is larger than 0.58, the size relationship is reversed if rjj is smaller than 
0.58. This means the student has higher level of performance than the moving teacher when rjj is smaller 
than 0.58. In regard to the steady R and the steady /, the size relationships are reversed when rjj = 0.70. 
In the limit of rjj — > 0, lj approaches unity, Rbj approaches Rb, and Rj approaches unity. That is, the 
student J coincides with the true teacher A in both direction and length when -qj — > 0. Note that the 
reason why the generalization error e j g of the student isn't zero in Fig. is that independent noises are 
added to the true teacher and the student. The phase transition in which Rj and Rbj become zero and 
h, e Bjg, and ej g diverge on rjj = 2 is shown in Figs. EHH1 
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Figure 6: Steady value of generalization errors es g ,ejg and esjg- Theory and computer simulation. 
Conditions other than rjj are t\b — 1-0, cr^ = 0.2, a 2 B — 0.3, and a 2 j = 0.4. 
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Figure 7: Steady value of R. Theory and computer simulation. Conditions other than rjj arc r/s = 
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Figure 8: Steady value of I. Theory and computer simulation. Conditions other than r\j are j]b — 
1.0, a\ = 0.2, a% = 0.3, and aj = 0.4. 



5 Conclusion 

The generalization errors of a model composed of a true teacher, a moving teacher, and a student that 
are all linear perceptrons with noises have been obtained analytically using statistical mechanics. It has 
been proven that the generalization errors of a student can be smaller than that of a moving teacher, 
even if the student only uses examples from the moving teacher. 
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